A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context
نویسنده
چکیده
We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. The model can achieve 96.6% tagging accuracy if unknown words are correctly segmented. 1 I n t r o d u c t i o n In Japanese, around 95% word segmentation accuracy is reported by using a word-based language model and the Viterbi-like dynamic programming procedures (Nagata, 1994; Yamamoto, 1996; Takeuchi and Matsumoto, 1997; Haruno and Matsumoto, 1997). About the same accuracy is reported in Chinese by statistical methods (Sproat et al., 1996). But there has been relatively little improvement in recent years because most of the remaining errors are due to unknown words. There are two approaches to solve this problem: to increase the coverage of the dictionary (Fung and Wu, 1994; Chang et al., 1995; Mori and Nagao, 1996) and to design a better model for unknown words (Nagata, 1996; Sproat et al., 1996). We take the latter approach. To improve word segmentation accuracy, (Nagata, 1996) used a single general purpose unknown word model, while (Sproat et al., 1996) used a set of specific word models such as for plurals, personal names, and transliterated foreign words. The goal of our research is to assign a correct part of speech to unknown word as well as identifying it correctly. In this paper, we present a novel statistical model for Japanese unknown words. It consists of a set of word models for each part of speech and word type. We classified Japanese words into nine orthographic types based on the character types that constitute a word. We find that by making different models for each word type, we can better model the length and spelling of unknown words. In the following sections, we first describe the language model used for Japanese word segmentation. We then describe a series of unknown word models, from the baseline model to the one we propose. Finally, we prove the effectiveness of the proposed model by experiment. 2 W o r d S e g m e n t a t i o n M o d e l 2.1 Baseline Language Model and Search Algorithm Let the input Japanese character sequence be C = Cl...Cm, and segment it into word sequence W = wl . . . wn 1 . The word segmentation task can be defined as finding the word segmentation 12d that maximize the joint probability of word sequence given character sequence P(WIC ). Since the maximization is carried out with fixed character sequence C, the word segmenter only has to maximize the joint probability of word sequence P(W). = arg mwax P(WIC) = arg mwax P(W) (1) We call P(W) the segmentation model. We can use any type of word-based language model for P(W), such as word ngram and class-based ngram. We used the word bigram model in this paper. So, P(W) is approximated by the product of word bigram probabilities P(wi[wi1).
منابع مشابه
برچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملUsing Context-based Statistical Models to Promote the Quality of Voice Conversion Systems
This article aims to examine methods of optimizing GMM-based voice conversion systems performance in which GMM method is introduced as the basic method for improvement of voice conversion systems performance. In the current methods, due to using a single conversion function to convert all speech units and subsequent spectral smoothing arising from statistical averaging, we will observe quality ...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کامل